Collocation Extraction using Parallel Corpus
نویسندگان
چکیده
This paper presents a novel method to extract the collocations of the Persian language using a parallel corpus. The method is applicable having a parallel corpus between a target language and any other high-resource one. Without the need for an accurate parser for the target side, it aims to parse the sentences to capture long distance collocations and to generate more precise results. A training data built by bootstrapping is also used to rank the candidates with a log-linear model. The method improves the precision and recall of collocation extraction by 5 and 3 percent respectively in comparison with the window-based statistical method in terms of being a Persian multi-word expression.
منابع مشابه
Collocation Translation Acquisition Using Monolingual Corpora
Collocation translation is important for machine translation and many other NLP tasks. Unlike previous methods using bilingual parallel corpora, this paper presents a new method for acquiring collocation translations by making use of monolingual corpora and linguistic knowledge. First, dependency triples are extracted from Chinese and English corpora with dependency parsers. Then, a dependency ...
متن کاملBilingual Collocation Extraction Based on Syntactic and Statistical Analyses
In this paper, we describe an algorithm that employs syntactic and statistical analysis to extract bilingual collocations from a parallel corpus. The preferred syntactic patterns are obtained from idioms and collocations in a machine-readable dictionary. Phrases matching the patterns are extract from aligned sentences in a parallel corpus. Those phrases are subsequently matched up via cross-lin...
متن کاملCollocation Extraction Using Monolingual Word Alignment Method
Statistical bilingual word alignment has been well studied in the context of machine translation. This paper adapts the bilingual word alignment algorithm to monolingual scenario to extract collocations from monolingual corpus. The monolingual corpus is first replicated to generate a parallel corpus, where each sentence pair consists of two identical sentences in the same language. Then the mon...
متن کاملFipsCoView: On-line Visualisation of Collocations Extracted from Multilingual Parallel Corpora
We introduce FipsCoView, an on-line interface for dictionary-like visualisation of collocations detected from parallel corpora using a syntactically-informed extraction method.
متن کاملInducing Discourse Connectives from Parallel Texts
Discourse connectives (e.g. however, because) are terms that explicitly express discourse relations in a coherent text. While a list of discourse connectives is useful for both theoretical and empirical research on discourse relations, few languages currently possess such a resource. In this article, we propose a new method that exploits parallel corpora and collocation extraction techniques to...
متن کامل